-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selenium grid support #89
base: main
Are you sure you want to change the base?
Conversation
muodov
commented
Feb 6, 2023
•
edited
Loading
edited
- To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1203911614339686
…ollector into selenium-grid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I left you some comments.
@@ -0,0 +1,25 @@ | |||
const DEFAULT_USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, I like that you extracted this!
@@ -56,6 +56,7 @@ class CLIReporter extends BaseReporter { | |||
this.progressBar.total = data.urls; | |||
this.progressBar.tick({ | |||
site: data.site, | |||
finished: `${finished} / ${data.urls}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is that different from :percent ? why do we need this number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On bigger crawls, percents change very slowly, and when something goes wrong (e.g. a deadlock or slow network), it is not always obvious. With this change, we will show the exact number of analyzed sites.
@@ -30,6 +30,7 @@ Available options: | |||
- `-r, --region-code <region>` - optional 2 letter region code. For metadata only | |||
- `-a, --disable-anti-bot` - disable simple build-in anti bot detection script injected to every frame | |||
- `--chromium-version <version_number>` - use custom version of Chromium (e.g. "843427") instead of using the default | |||
- `--selenium-hub <url>` - instead of running a local browser, request a remote browser at the provided Selenium hub endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add it to other examples in the readme (below)
*/ | ||
function filterUrls(inputUrls, logFunction, outputPath) { | ||
return new Promise((resolveFilterUrls, rejectFilterUrls) => { | ||
asyncLib.filter(inputUrls, (item, filterCallback) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is performance here worth the the complexity of doing filtering async?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do a detailed profiling, but with hundreds of thousands URLs, this takes quite a lot of time (tens of seconds). Depending on the FS speed it could make a difference I think.
The latest async lib supports promises, so if/when we upgrade it it should look less ugly.
if (seleniumHub) { | ||
try { | ||
prefixedLog(`Getting remote browser...`); | ||
driver = await getRemoteDriver({seleniumHub, chromiumVersion, proxyHost}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is chromiumVersion
in the expected format here? Is selenium also expecting "revision" instead of regular "version" number? I understand that we are reusing --chromium-version
flag to pass it?
@@ -0,0 +1,111 @@ | |||
const {Builder} = require("selenium-webdriver"); | |||
const puppeteer = require('puppeteer'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have inconsistent quotes (' vs ") can we require ' in linter?
extraExecutionTimeMs, | ||
collectorFlags, | ||
}); | ||
let browserContext = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this selenium stuff doesn't feel right in here, this function is not about setting up a browser, but rather doing a single site crawl and dumping the data. I'd rather do it in the crawler.js/openBrowser - that's where we are setting up the browser. Additionally, it will help with some of the repetition (e.g. flags - '--enable-blink-features=InterestCohortAPI',)
const opts = new chrome.Options(); | ||
|
||
// default chrome arguments passed by puppeteer 10.2.0 | ||
opts.addArguments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we extract those flags from puppeteer somehow? e.g. aren't they in some public constant?